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Abstract 

Unsupervised clustering of scattered, noisy and high-dimensional data points is an important and 
difficult problem. Tight continuous relaxations of balanced cut problems have recently been shown to 
provide excellent clustering results, fn this paper, we present an explicit-implicit gradient flow scheme 
for the relaxed ratio cut problem, and prove that the algorithm converges to a critical point of the energy. 
We also show the efficiency of the proposed algorithm on the two moons dataset. 

1 Introduction 

Partitioning data points into sensible groups is a fundamental problem in machine learning and has a wide 
range of applications. An efficient approach to deal with this problem is to cast the data partitioning 
problem as a graph clustering problem. Given a set of data points V = {xi, . . . , x„} and similarity 
weights {wi,j}i<i,j< n , the clustering problem aims at finding a balanced cut of the graph of the data. 
In this work, we consider the balanced cut of Hagen and Kahng [S] known as ratio cut. The ratio cut 
problem is 

Minimize RatioCut(S) = gf^jgpf^ + E ^»^^ j (1) 
over all subsets SCF. 

Here \S\ denotes the number of data points in S. While the problem, as stated above, is NP-hard, it has 
the following tight continuous relaxation: 

. . „, ,^ 2 SiJ w i,Afi ~ fj\ ,„» 

Munmrze E(f) = EJ/< _ m(/)| (2) 

over all non-constant functions / : V — > R. 

Here m(f) stands for the average of / £ R™ and fi stands for f(xi). Recently, various algorithms have 
been proposed |12|[6l[71[TlllO] to minimize relaxations of balance cut problem similar to ([2]). In this work, 
we present an explicit-implicit gradient flow algorithm, then prove that the iterates converge to critical 
points of the energy. We also present numerical experiments to show the robustness and efficiency of the 
algorithm. 
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1.1 The Tight Continuous Relaxation 

We begin by first explaining the meaning of the term tight relaxation. Since E is invariant under the 
addition of a constant, problem (J2J) is equivalent to 

Minimize l^id^lZlA (3 ) 

over all / : V ->• R s.t. m(f) = and / / 0. 

If the graph is connected then the total variation functional | 2~Zi . Wi,j\f% — fj\ defines a norm on the 
space of mean zero functions; we denote it by ||/||tv- The denominator of ((3} is simply the I -norm, 
and we denote it by 

The continuous problem ((3]) is a tight relaxation of ([T]) in the following sense — if S* is a solution of 
(0, then any nonzero, binary function of mean zero 

f fa) = <, ., ^ ,„„, c (4) 
[b if Xi £ (S ) c 

is a solution of problem (J3J). This is a consequence of the fact that the the extreme points of the TV-unit 
ball 

{/eR":||/l|TV<l,m(/)=0} 

are binary functions (see [12] for a proof of this fact). Therefore, if we fix ||/||tv — 1 and maximize the 
convex functional in the denominator of (J3J), the minimum of the ratio is attained at an extreme point. 
That is, at a binary function of mean zero. Binary functions of mean zero are always of the form 

/ = A(|S c | XS -|S| XS c), SCV, A/0, 

where xs is the characteristic function of the set S. For such a function, we easily check that E(f) = 
RatioCut(5)/2. From this observation we can see that if S* is a solution of the ratio cut problem {TJ, 
then /* = A (|(S'*) c |xs* + \S*\X(S*) C ) is a solution of the continuous relaxation © for any A / 0. A 
different proof of the fact that problem @ is a tight relaxation of problem {TJ can be found in [TD] . 

1.2 Explicit-implicit gradient Flow 

Let 

= \\f\Wv and B(f) =*£\fi~ m(/)|. (5) 

i 

Note that both T and B are convex. If T and B were differentiate, the explicit-implicit gradient flow 
of E = T/B would be 

f k+i _ f k V T(/ fc+1 ) - E(f)VB(f k ) 



where r h is the time step. Since T and B are not differentiable, we replace ((6| with its non-smooth 
equivalent: 



gk = f " + BJW) E{fk)vk f° r somo v k G dB(f k ) (7) 
f k+1 = argmin |t(/) + B(/ fe ) I^_|^L! | . (8) 

The minimization problem ([8]) is a standard ROF problem [11] that can be solved efficiently using 
approaches such as augmented Lagrangian method [J or primal-dual method [3]. The scheme ([7)l-(f8)l. 
as will be shown in the next section, decreases the energy and preserve the zero mean properties of 
the successive iterates. In order to remain away from the origin, where the energy is not defined, we 
project each iterate onto the sphere 5 n_1 = {u £ R n : 1 1 2 = 1} at the end of each step. In numerical 
experiments we observe faster convergence when the time step is chosen to be 

k = B(f k ) 

E(f k )' (> 
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With these choices, we arrive at our proposed algorithm to find critical points of the ratio cut func- 
tional (J2| : 

g k = f + cv k for some v k G dB(f k ) (10) 
h k = argmin |t(/) + E(f k ) U ZJpL j (n) 

fk+1 = Wh> (12) 

which we formalize in Algorithm 1. 



Algorithm 1 Steepest descent of the RatioCut functional © 
j k =o nonzero function with mean zero, 
c positive constant, 
while loop not converged do 

w k G sign(/ fc ), v k ~ w k — m(w k ) , \ k = ^jfyij[7 

gk _ fkj rCy k 

h k = a rgn,in f {\\f\\ TV + £\\f-g k \\l} 

fk+1 _ h k 
J ~ ||ft. fc || 2 

end while 



Let {f k } denote a sequence of iterates generated by Algorithm 1. starting from a non-zero function 
f° with m(/°) = 0. In section 2, we show that any accumulation point of this sequence is a critical point 
of the the ratio cut functional ([2]). Moreover we show that ||/ fc+1 — f k \\2 — > as k — > oo, so that either 
the sequence converges or the set of accumulation points is a connected subset of the sphere <S n_1 . In 
section 3 we demonstrate the efficiency of Algorithm 1 on the two moons example. 



2 Convergence 

Given a connected graph, we want to minimize 

T,7,j = l w i,j\fi - fj\ = n/) 

over the space of non-constant functions / € R n . (Note that E is not defined for constant functions). 
This is equivalent to minimizing E over the set of non-constant functions with mean zero, which we write 
as 

T = {/ 6 K n : m(f) = and / / 0}. 

We define 1 := (1, ... , if € K n , so that m(f) = (1, f)/n and l x gives the space of functions with mean 
zero. Clearly _F is an open subset of l x . As we assume a connected graph, T and B define norms on 
l x . Since all norms are equivalent in finite dimensions, there exist constants ft > a > such that 

aB(f)<T(f)</3B(f) forall/Gl ± . 

Therefore 

a < E(f) < P for all / G T. 

If we let 

n 

M/) = ll/lli=El/<1- Pof = f-m(f)l, (13) 

i=l 

then we see that B(f) — L(Pof). Note that Po = Id — ^;11 T , so that the matrix Po simply gives the 
orthogonal projection onto l x . As L(f) is convex, so is B(f) = L(Pof), and we also have 

dB(f) = P sign(P /). 
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It is then easy to see that (dB(f), 1) = for all /. If / G 1 ± , then B(f) = L(f) and dB{f) is simply 
the projection of dL(f) on 1 , i.e. dB(f) = Po sign(/). 

Starting from a non-constant function /, we define g and h according to Algorithm \T\ 



h = arg min { T(u) + E(f) l,M g||i } , 



g — f + cv, where v G dB(f) (14) 

- 9 
2c 

which we write succinctly as 

heH c (f). 

Since g is not uniquely defined when B(f) is non-differentiable, in general H c (f) may have more than 
one element. Therefore the map H c is a set-valued map defined over the space of non-constant functions 
(see Definition [2] in the following subsection). 

2.1 Estimates 

Lemma 1 (Elementary properties of T-i c ). Let g and h be defined by (|14[) - (|15[) . 

1. If f is not constant, then h is not constant. Moreover, the energy inequality 

E(f)>E(h) + ^^-M (16) 

holds. As a consequence, E(h) < E{f) unless h = f . 

2. If f is not constant, then 

\\hh < hh < Wfh + 2<V^- (17) 

3. If f G K n , then ||p||2 > H/II2, or, to be more precise: 

\\g\\l = \\m+2cB{f) + c 2 \\dB{f)\\l 

4- If f G T ' , then g,h 6 T. 
Proof. (1.) The definition (JT5J of h implies that E(f)^ G -dT(h), and therefore, since T is convex, 

T(f)>T(h) + /-E(f)^-°,f-h\ (18) 
= T(h)-E(f)( h -V e +e0 \ f-h) (19) 
= T(h) + ^\\h- f\\ 2 2 - E(f) (v, h-f). (20) 
Since B is also convex, we have B(h) > B(f) + {v, h— /), and therefore adding these two last inequalities, 

T(f) + E(f)B(h) > T(h) + E(f)B(f) + ^-\\h- 

c 

In other words, 



E(f)B(h)>T(h) + ^p-\\h-f\\l 

Since / is not constant, we have E(f) > 0. Note that if h were constant, then B(h) = which would 
imply h — f. This is a contradiction since / is not constant. Thus B(h) > 0, so we may divide in the 
last expression to obtain (11611 . 



(2.) To prove that \\h\\2 < \\g\\2, note 

E(f) 



Wu — oil 2 ] c 
h = prox s (g) := argmin <! $(it) + ^ where = -——rT(u). 
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Since proximal mappings are Lipshitz continuous with constant one, and since prox $ (0) = 0, we have 

\\h\\2 = Uproots) - praK»(0)||a < hh- (21) 
To establish the inequality \\g\\2 < II/II2 + 2c\fn, note that ||sign(Po/)||oo < 1 and therefore 

ll<9P(/)|| 2 < v^||0fl(/)||,» = y^llsignCPo/) - m(sign(P /))l|U < 2^n~ for all / € 1". (22) 
The upper bound then follows from the definition of g and the triangle inequality. 

(3.) Since B is homogeneous of degree one, we have 

\\g\\l = ||/ + cdB(f)\\ 2 2 = \\f\\l + 2c(f,dB(f)) + c 2 \\dB(f)\\ 2 2 = + 2cB(/) + c 2 \\dB(f)\\l (23) 



(4.) Since dB(f) C 1 , it is clear that / £ 1 implies g G 1 . Equation (|23j shows that ||p|| 2 > H/H2 > 
so that g cannot be constant (the only constant function of mean zero is the zero function) . Thus g G JF. 
Suppose that h £ 1 . Since Po projects onto l x and since T(Poii) = T(u) for all u G R n (because T is 
invariant under addition of a constant), we have 

T{h) + ^H\\h-gf 2 = T(P h) + ^1 (\\P h -g\\l + ||(Id - P )h\\l) . 
Ac Zc 

This contradicts the definition of h as the global minimizer unless (Id — Po)h = 0. Thus h has mean zero. 
By property (1.) we know h is not constant, so ft £ J as well. □ 

Definiton 1. Let f G T '. We say that f k ,g k ,h k is a sequence generated by the algorithm if 

f +1 G P2(H c (/ fc )) where P2 is the projection onto the sphere S"^ 1 

and where g k and h k are defined from f k by (|14[) and (|15|) . 

Lemma 2 (Properties of the iterates). If f k ,g k ,h k is a sequence generated by the algorithm, then 
E(f k+1 ) < E(f k ) with equality if and only if f k = f k+1 . Moreover, 

\\f k -h k h^0 and ||/ fc -/ fc+1 || 2 ^0. (24) 

Therefore S n ~ 1 is an attractor for the sequence {h k }. 

Proof. The fact that the energy decreases is a consequence of (|16|) from Lemma [T] together with the fact 
that E(f k+1 ) = E(h k ) due to the invariance of E under scaling. As f k G 1 ± and ||/ fe ||2 = 1 it follows 
that E(f k ) > a > 0. From (0 we then have 



Now from (|17[1 we have 
and therefore 



- f k \\% < -B(h k )(E(f k ) - E(f k+1 )). (25) 
a 

B(h k ) = ||ft fe ||i < Vn~\\h k \\ 2 < Vn~ + 2nc, 

\\h k -f k f 2 < -(V^ + 2nc)(E(f k ) - E(f k+1 )) -> 0, 
a 

where we have used that E(f k ) is a converging sequence since it is decreasing and bounded from below. 

We now show ||/ fc — / fc+1 ||2 — > 0. Note that the projection P2 is smooth on the annulus A := {u G 
R n : 1/2 < ||w|| < 3/2} and therefore it is Lipschitz continuous on A with constant, say, C. Since 
eventually h k G A, we have 

\\f k - f" +1 h = \\P2(f k ) ~ Pi(h k )h < C\\f k - h k \\ 2 ^ 0. 

□ 
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2.2 Proof of convergence 

Deflniton 2 (Set-valued map). Let X and Y be two subsets o/R". If for each x G X there is a 
corresponding set F(x) C Y then F is called a set-valued map from X to Y . We denote this by 
F : X 4 V. The graph of F, denoted Graph(F) is defined by 

Graph(F) = {(x,y) G 1" x R n : y G F(x),x G X}. 

A set-valued map F is called closed if Graph(F) is a closed subset of R n x R* 1 . 
Define the compact sets 

K x = {u G 1" : ||u]| 2 = 1 and m(u) = 0} (26) 
K 2 = {u G 1" : 1 < ||u|| 2 < 1 + 2<Vn and m(u) = 0} (27) 

along with the set-valued map y c : K\ =J K2 

y c (f) = f + cdB(f). 

The fact that the range of y c is in K2 is a consequence of (|17|) . 
Lemma 3. The set-valued map y c is closed. 

Proof. Let us first show that the set-valued map sign : R n =t [—1, l] n is closed. Let assume that 

f k -> r (28) 

z k G sign(/ fc ) -> (29) 

We want to show that 2:* G sign(/*), or equivalently, 2* G sign(/*) for all 1 < i < n. If /* > then 
/* > for large enough. As z\ — 1 for all such k it follows that z* = 1 = sign(/*). Similar reasoning 
applies if /* < 0. Lastly, if /* = then sign(/*) = [—1, 1]. The entire sequence {zf}'^L 1 therefore lies in 
sign(/*), so obviously z* G sign(/*) as well. 
To show that y c is closed, assume first that 

/* r (30) 

g k e y c {f k ) = f k + cP sign(/ fc ) -> (31) 

where we have used the fact that dB(f) = Po sign(/) whenever / G A"i. Thus our goal is to prove that 
g* G ^ c (/*). Clearly there exists z k G sign(/ fe ) such that 

g k = f k + cP z k . (32) 

Since z k lies in a compact set there exists a subsequence z ki —¥ z* . So we have 

f hi -> r (33) 

Z fe< G sign(/ fc *) -> (34) 
Since sign is closed z* G sign(/*), which combines with (|32[) gives 

ff fe< +cP z* ey c (f) 

where we have used the definition of y c (f*) and the fact that /* G K\. From (|31|) we then obtain 
g* G y c {f*) as desired. □ 

We define the function * c : Ki x K 2 -> R d 

- ' 



* c (/, ff ) = arg min |r(u) + J 1 ' 2 | 



Lemma 4. T/ie function >I' C is continuous on K\ x A2. 
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Proof. Let ft = ^ c {f,g) and ti = * c (/',g')- The n we have E(f)^ G -dT(ft) and E(f')±-^- G 
-dT(h') so 



T(h')>T(h)-/E(f)t-l,h'-h\ 
T(ft) > T(ft') - / E (f')^^,h - ti^ . 



By adding these two inequalities, 

(E(/)(ft - g) - E(f')(ti -g'),h- ti) < 0. 

Adding and subtracting we get 

(E{f){h -g)- E{f){ti -g'),h- ti) + ((£(/) - E(f'))(ti -g'),h-ti)<0 
E(f)((h - ti) -(g-g'),h- ti) + (E(f) - E{f'))(h' -g',h- ti) < 
E(f) (\\h - h'\\l -{g-g',h- ti)) + (E(f) - E{f'))(h' -g',h- ti) < 
\\h-h\\ 2 <(g-g,h-h^ '—jr {h -g,h-h) 

From Cauchy-Schwarz we have 

W _ hh < y _ gb + Whffl v < y _ 9h + wn^ y,, 

The last inequality follows from (|2ip. We then easily conclude that if (/',<?') — > (/, <?) then ft' — > ft, due 
to the continuity of E on J\ i . □ 

We next show that the set- valued map T-i c : K\ =t T 

n c (f) = * c (f,y c (f)) 

is closed. The fact that the range of H is in T is a consequence of Lemma [TJ 
Lemma 5. The set-valued map H c is closed. 

Proof. Suppose that 

f k -> r (35) 

ti G H c (/ fe ) = * c (/ fc ,3^ c (/ fc )) -> ft*. (36) 
We must show that ft* G H c (f*). Clearly there exist g k G y c (f k ) such that 

fc* = *"(/*,«,*). 

Since the sequence is in the compact set K% there exists g* G -K2 and a subsequence g fc * — > g* . So we 
have 

f k > -> /* (37) 

/'eWVs', (38) 

from which we conclude that p* G ^ c (/*) because y c is closed. Now since ^ c is continuous we have 

h H = VI,-,/'- , ^ g * c (r,r(/*)) = h c (d. 

But ft fei — >■ ft*, so we may conclude ft* G H c (f*) as desired. □ 
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Deflniton 3 (Critical points). Let f € T. We say that f is a critical point of the energy E(f) if there 
exist w G dT(f) and v € 8B(f) so that 

= w - E(f)v. 

If both T and B are differ entiable at f then the subdifferentials 8T(f),dB(f) are single-valued, so we 
recover the usual quotient-rule 

= VT(/) - E(f)VB(f). 

Theorem 1 (Convergence of the algorithm). Take f° G T and fix a constant c > 0. Let {f k }^^ C T 
be a sequence generated by the algorithm. Then 

1. Any accumulation point f* of the sequence is a critical point of the energy. 

2. Either the sequence converges, or the set of accumulation points is a connected subset o/5 n_1 . 

Proof. (1.) The proof is inspired by [8]. Let f ki denote a subsequence converging to /*. Since the 
sequence {f ki + 1 }°Z 1 lies in a compact set we can extract a further subsequence (still denoted {f ki + 1 }) 
that converges to some function /'. So we have, as i — > oo 

f ki -> /' (39) 
/ fel+1 -> /'• (40) 
But, because of (|24[) it must be that /* = /'. Thus we have 

f k - -> r (41) 

/'■ +1 eP2(tf(/ k '))^f. (42) 

Clearly there exist h ki € H c (f k ') such that + 1 = P 2 {h ki ). Since the h ki eventually lie in the annulus 
A := {1/2 < ||u||2 < 3/2}, we can assume (upon extracting another subsequence) that the h ki G A — > 
h* G A. Therefore we have 

f ' -> f (43) 
h k ' €H c {f k ') -> h* (44) 

and since H c is closed h* G 1-L c (f*). Since P2 is continuous in the annulus A and all limit points of {h k } 
lie on S™" 1 , we conclude that 

/ fc ' +1 = P 2 (ft fei ) -> Po_(h*) = h* € U c {f). 

From (O we therefore have /* G H c (f*). By definition of H c (f*), if /* G H c (f*) then there exists 
V* G y c (f) so that 

f = argrnin |t(«) + E(f* ) ^ ~f 

Therefore there exists w* G dT(f) so that = ctu* + E(f*)(f* — y*). By definition of !V C (/*) there 
exists G dB(f*) so that 

= ciy* + E(f)(f - {f + cv*)) = c{w* - E(f)v*). 

Thus /* is a critical point of the energy according to definition [3] 

(2.) For any sequence generated by the algorithm, ||/ fc+1 — / fc ||2 — > according to lemma [24] 
Moreover, they lie in the bounded set S n ~ C R n . The hypotheses of Theorem 26.1 of [9] are therefore 
satisfied, giving the desired conclusion. □ 
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(a) Two moons dataset 



(b) Desired clustering 



gure 1: Unsupervised clustering of the two moons dataset. Each moon has 1,000 data points in M 

3 Experiments 

We construct the two moons dataset as in [2] (Figure [1]) . The first moon is a half circle of radius one 
in R 2 , centered at the origin, sampled with a thousand points; the second moon is an upside down half 
circle also sampled at a thousand points, but centered at (1,-1/2). The dataset is embedded in R 100 
by adding Gaussian noise with a = 0.015. In all experiments we use a 10 nearest neighbors graph with 
the self-tuning weights as in [13] (the neighbor parameter in the self-tuning is set to 7 and the universal 
scaling to 1). The constant c in Algorithm [T] is taken to be c = 1/4. 

Clustering results with different initial conditions are shown in Figure [2] Since the energy is not 
convex there is no guarantee that the algorithm will converge toward the global minimizer of the ratio 
cut functional. However, for most initial data, the algorithm indeed finds the correct solution in a very 
small number of iterative steps. 

References 

[1] X. Bresson, X.-C. Tai, T.F. Chan, and A. Szlam. Multi-Class Transductive Learning based on I 
Relaxations of Cheeger Cut and Mumford-Shah-Potts Model. UCLA CAM Report, 2012. 

[2] T. Biihler and M. Hein. Spectral Clustering Based on the Graph p-Laplacian. In International 
Conference on Machine Learning, pages 81-88, 2009. 

[3] A. Chambolle and T. Pock. A First-Order Primal-Dual Algorithm for Convex Problems with Ap- 
plications to Imaging. Journal of Mathematical Imaging and Vision, 40(1):120-145, 2011. 

[4] T. Goldstein and S. Osher. The Split Bregman Method for Ll-Regularized Problems. SIAM Journal 
on Imaging Sciences, 2(2):323-343, 2009. 

[5] L. Hagen and A. Kahng. New spectral methods for ratio cut partitioning and clustering. IEEE 
Trans. Computer- Aided Design, 11:1074 -1085, 1992. 

[6] M. Hein and T. Biihler. An Inverse Power Method for Nonlinear Eigenproblems with Applications 
in 1-Spectral Clustering and Sparse PCA. In In Advances in Neural Information Processing Systems 
(NIPS), pages 847-855, 2010. 

[7] M. Hein and S. Setzer. Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts. 
In In Advances in Neural Information Processing Systems (NIPS), 2011. 

[8] R.R. Meyer. Sufficient conditions for the convergence of monotonic mathematical programming 
algorithms. Journal of Computer and System Sciences, 12(1):108 - 121, 1976. 

[9] A. M. Ostrowski. Solution of Equations in Euclidean and Banach Spaces. Academic Press, New 
York, 1973. 

[10] S. Rangapuram and M. Hein. Constrained 1-Spectral Clustering. In International conference on 
Artificial Intelligence and Statistics (AISTATS), pages 1143-1151, 2012. 



9 



(a) Initialization #1 (2nd 
eigenvector of graph Lapla- 
cian) 



(b) Outcome of Algorithm [T] 



(c) Energy w.r.t. iteration 
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(d) Initialization #2 (random 
init) 



(e) Outcome of Algorithm [T] 



(f) Energy w.r.t. iteration 





(g) Initialization #3 (random 
init) 



(h) Outcome of Algorithm [T] 



(i) Energy w.r.t. iteration 
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fiST 




(j) Initialization #4 (random 
init) 



(k) Outcome of Algorithm [T] (1) Energy w.r.t. iteration 



Figure 2: Outcomes of Algorithm [T] with different initial data. On the right column the value of the ratio 
cut functional (|SJ| is plotted versus the number of iterations. 
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